Reading Mass Spec data files

You can read several files at the same time using the merger module. But for now, we will read only a single file.

Reading a CSV file and get descriptive information about the mass spec data using the module msdas.readers.MassSpecReader


In [1]:
from msdas import *
from msdas import yeast
%pylab inline


Couldn't import dot_parser, loading of dot files will not be possible.
Populating the interactive namespace from numpy and matplotlib

Here are some files to play with. These are 6 files that should be merger. However, we can read them one by one for demonstration.

Reading the data (36 columns of experiments + 7 of metadata)


In [2]:
y = MassSpecReader(yeast.get_yeast_small_data())


INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/YEAST_small_all.csv
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost

Calling the print function allows to get some basic information about the number of rows/protein/peptides


In [3]:
print(y)


This dataframe contains 40 columns in addition to the standard columns Protein,
Sequence, Psite
Your data contains 23 unique proteins
Your data contains 57 combination of psites/proteins

The data is contained in the data frame called df. Aliasese (read-only) to the measurements only, or to the metadata only are available in the dataframs called measurements and metadata respectively.


In [4]:
y.df.ix[[0,1,2]]


Out[4]:
Protein Sequence Psite Sequence_Phospho a0_t0 a0_t1 a0_t5 a0_t10 a0_t20 a0_t45 ... a20_t45 a45_t0 a45_t1 a45_t5 a45_t10 a45_t20 a45_t45 Entry Entry_name Identifier
0 DIG1 DGNLASSNSAHFPPVANQNVK S126+S127 DGNLAS(Phospho)SNSAHFPPVANQNVK 0.000415 0.000397 0.000671 0.000602 0.000440 0.000418 ... 0.001009 0.000289 0.000300 0.000270 0.000358 0.000367 0.000313 Q03063 DIG1_YEAST DIG1_S126+S127
1 DIG1 SAPAQVTQHSK S142 S(Phospho)APAQVTQHSK 0.000187 0.000185 0.000267 0.000202 0.000138 0.000226 ... 0.001241 0.001144 0.001364 0.001237 0.001091 0.001425 0.001707 Q03063 DIG1_YEAST DIG1_S142
2 DIG1 VNDSYDSPLSGTASTGK S272 VNDSYDS(Phospho)PLSGTASTGK 0.000338 0.000330 0.000538 0.000505 0.000381 0.000328 ... 0.000232 0.000178 0.000208 0.000122 0.000212 0.000203 0.000221 Q03063 DIG1_YEAST DIG1_S272

3 rows × 43 columns


In [5]:
y.measurements.ix[[0,1,2]]


Out[5]:
a0_t0 a0_t1 a0_t5 a0_t10 a0_t20 a0_t45 a1_t0 a1_t1 a1_t5 a1_t10 ... a20_t5 a20_t10 a20_t20 a20_t45 a45_t0 a45_t1 a45_t5 a45_t10 a45_t20 a45_t45
0 0.000415 0.000397 0.000671 0.000602 0.000440 0.000418 0.001416 0.001090 0.000775 0.000668 ... 0.001149 0.000917 0.000902 0.001009 0.000289 0.000300 0.000270 0.000358 0.000367 0.000313
1 0.000187 0.000185 0.000267 0.000202 0.000138 0.000226 0.000177 0.000162 0.000136 0.000142 ... 0.001135 0.000899 0.001064 0.001241 0.001144 0.001364 0.001237 0.001091 0.001425 0.001707
2 0.000338 0.000330 0.000538 0.000505 0.000381 0.000328 0.000367 0.000379 0.000330 0.000372 ... 0.000349 0.000319 0.000314 0.000232 0.000178 0.000208 0.000122 0.000212 0.000203 0.000221

3 rows × 36 columns


In [6]:
y.metadata.ix[[0,1,2]]


Out[6]:
Identifier Protein Sequence Sequence_Phospho Psite Entry Entry_name
0 DIG1_S126+S127 DIG1 DGNLASSNSAHFPPVANQNVK DGNLAS(Phospho)SNSAHFPPVANQNVK S126+S127 Q03063 DIG1_YEAST
1 DIG1_S142 DIG1 SAPAQVTQHSK S(Phospho)APAQVTQHSK S142 Q03063 DIG1_YEAST
2 DIG1_S272 DIG1 VNDSYDSPLSGTASTGK VNDSYDS(Phospho)PLSGTASTGK S272 Q03063 DIG1_YEAST

Statistics about Phospho sites


In [7]:
y.plot_phospho_stats()


Histogram peptide length


In [8]:
y.hist_peptide_sequence_length()


Visulalisation variation in each experiment


In [9]:
y.boxplot() # variation in each experiment


/home/cokelaer/Work/virtualenv/lib64/python2.7/site-packages/pandas/tools/plotting.py:2633: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  warnings.warn(msg, FutureWarning)

Visualise data as a "time series"


In [10]:
y.plot_timeseries('DIG1_S272')


Out[10]:
a0_t0 a0_t1 a0_t5 a0_t10 a0_t20 a0_t45 a1_t0 a1_t1 a1_t5 a1_t10 ... a20_t5 a20_t10 a20_t20 a20_t45 a45_t0 a45_t1 a45_t5 a45_t10 a45_t20 a45_t45
2 0.000338 0.00033 0.000538 0.000505 0.000381 0.000328 0.000367 0.000379 0.00033 0.000372 ... 0.000349 0.000319 0.000314 0.000232 0.000178 0.000208 0.000122 0.000212 0.000203 0.000221

1 rows × 36 columns

Visualise data in a 6 by 6 image (YEAST case only)


In [11]:
y.plot_experiments("DIG1_S272")


WARNING:root:Works with yeast data set only
Out[11]:
0 1 2 3 4 5
a0 0.000338 0.000330 0.000538 0.000505 0.000381 0.000328
a1 0.000367 0.000379 0.000330 0.000372 0.000415 0.000270
a5 0.000505 0.000465 0.000502 0.000473 0.000418 0.000493
a10 0.000521 0.000550 0.000538 0.000629 0.000478 0.000460
a20 0.000243 0.000312 0.000349 0.000319 0.000314 0.000232
a45 0.000178 0.000208 0.000122 0.000212 0.000203 0.000221

Creating an instance from an existing instance


In [12]:
y2 = readers.MassSpecReader(y, verbose=False)

In [13]:
y2 == y


Out[13]:
True

Reading data with empty instance of MassSpecReader


In [14]:
y3 = readers.MassSpecReader(verbose=True)
filename = yeast.get_yeast_small_data()
y3.read_csv(filename)


INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/YEAST_small_all.csv

In [15]:
y3 == y


Out[15]:
False

Here, the data read using the function read_csv seems to be different. Indeed, when reading a file normally, the cleanup function is called automatically. So, you have to call the cleanup function manually:


In [16]:
y3.cleanup()


INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost

In [17]:
y3 == y


Out[17]:
True

In [ ]: